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[57] ABSTRACT 

The present invention is a fully connected feed forward 
network that includes at least one hidden layer 16. The 
hidden layer 16 includes nodes 20 in which the output of the 
node is fed back to that node as an input with a unit delay 
produced by a delay device 24 occurring in the feedback 
path 22 (local feedback). Each node within each layer also 
receives a delayed output (crosstalk) produced by a delay 
unit 36 from all the other nodes within the same layer 16. 
The node performs a transfer function operation based on 
the inputs from the previous layer and the delayed outputs. 
The network can be implemented as analog or digital or 
within a general purpose processor. Two teaching methods 
can be used: (1) back propagation of weight calculation that 
includes the local feedback and the crosstalk or (2) more 
preferably a feed forward gradient decent which immedi- 
ately follows the output computations and which also 
includes the local feedback and the crosstalk. Subsequent to 
the gradient propagation, the weights can be normalized, 
thereby preventing convergence to a local optimum. Edu- 
cation of the network can be incremental both on and 
off-line. An educated network is suitable for modeling and 
controlling dynamic nonlinear systems and time series sys- 
tems and predicting the outputs as well as hidden states and 
parameters. The educated network can also be further edu- 
cated during on-line processing. 
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BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention is directed to a neural node network 
with local feedback and crosstalk along with a method of 20 
teaching same and, more particularly, to a feed forward 
neural network with local node feedback and crosstalk 
between nodes within a layer, in which the learning method 
includes feed forward weight modification along with propa- 
gation of the local feedback and crosstalk during the weight 25 
modification, with the network being used to develop a 
model of an actual system, where the model can be used to 
determine hidden states or parameters of the actual system 
or expected outputs of the actual system to possible input 
changes. 30 

2. Description of the Related Art 

Conventional perception based neural networks include 
input and output layers and hidden layers between the input 
and output layers. The hidden layers are fully connected, that 35 
is each node in a hidden layer is connected to all the nodes 
in a prior layer and to all nodes in a subsequent layer. Each 
of the nodes conventionally sums weighted inputs to the 
node and then performs a transfer function operation such as 
a threshold comparison operation to produce an output to the 40 
next layer. The transfer function is sometimes called a 
transform function, an activation function, a gain function, 
a squashing function, a threshold function, or a sigmoid 
function. Since the introduction of recurrent neural networks 
by Hopfield a number of researchers have considered vari- 45 
ous architectures and learning algorithms. The most promi- 
nent among these are (1) the real time recurrent learning 
which uses a purely feedback network, (2) back propagation 
through time which uses a purely feed forward network, (3) 
recurrent back propagation trained to recognize fixed points, 50 
(4) the use of the previous approach to learning the trajectory 
of unforced systems, (5) dynamic back propagation and (6) 
the grouping of feedback links as nodes of a feed forward 
network. None of these architectures provides a suitable 
network for modeling and controlling dynamic nonlinear 55 
systems and none of the learning methods are particularly 
efficient at converging to an optimum solution for dynamic 
nonlinear systems. 

SUMMARY OF THE INVENTION 60 

It is an object of the present invention to provide a feed 
forward neural network with local recurrency and intralayer 
recurrency. 

It is also an object of the present invention to produce a 65 
network that can model dynamic nonlinear systems to 
predict system outputs and hidden states or parameters. 


2 

It is an additional object of the present invention to 
provide a learning method that educates the network using 
the local and intralayer recurrency. 

It is a further object of the present invention to provide a 
learning method that improves learning speed by feed for- 
ward gradient modification trailing one layer behind output 
computations or within the layer computations but subse- 
quent to the output computations. 

It is still another object of the present invention to provide 
a learning method that ensures global optimization by avoid- 
ing local optimal through weight normalization via zeroth 
and/or higher order moments of the error gradients. 

It is a further object of the present invention to provide a 
modeling method capable of modeling and controlling non- 
linear dynamic systems and time series systems, thereby 
allowing the simulation of dynamic nonlinear system 
responses and the determination of hidden states and param- 
eters for such systems. 

The above objects can be attained by a network that 
includes fully connected hidden layers. Each hidden layer 
includes nodes in which the output of the node is fed back 
to that node as an input with a unit delay occurring in the 
feedback path (local feedback). Each node within each layer 
also receives a delayed output (crosstalk) from all the other 
nodes within the same layer. During the teaching phase, 
back propagation of the weights using a gradient decent 
method that includes the local feedback and the crosstalk 
can be performed. The teaching phase can also utilize a feed 
forward gradient decent method which follows the output 
computations and that also includes the local feedback and 
the crosstalk. Subsequent to the weight propagation, the 
weights can be normalized, thereby preventing entrapment 
in a local optimum. An educated network is suitable for 
modeling dynamic nonlinear systems and time series sys- 
tems and is capable of predicting system outputs as well as 
hidden states and parameters. Furthermore an educated 
network can be used to control a non-linear dynamic system. 
The educated network can also be further educated during 
on-line processing for prediction of system outputs, hidden 
state or parameters, as well as for control of dynamic 
nonlinear systems. 

These together with other objects and advantages which 
will be subsequently apparent, reside in the details of 
construction and operation as more fully hereinafter 
described and claimed, reference being had to the accom- 
panying drawings forming a part hereof, wherein like 
numerals refer to like parts throughout. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 illustrates a network in accordance with the present 
invention; 

FIG. 2 illustrates a operations of a node in accordance 
with the present invention; 

FIG. 3 illustrates an analog implementation of a network 
in accordance with the present invention; 

FIGS. 4 and 5 illustrate components of the network of 
FIG. 3 for continuous time implementation; 

FIG. 6 illustrates components of the network of FIG. 3 for 
discrete time implementation; 

FIG. 7 illustrate a digital implementation of the present 
invention using discrete components; 

FIGS. 8A and 8B illustrate a first embodiment of an 
implementation of the present invention in a general purpose 
computer; 
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FIGS. 9A-9C illustrate a second embodiment imple- 
mented in a general purpose computer; 

FIG. 10 illustrates a third embodiment of the present 
invention which can be applied to the embodiments in FIGS. 
8 and 9; 

FIG. 11 illustrates the use of the present invention to 
model a dynamic nonlinear system such as a process control 
plant; 

FIGS. 12-15 illustrate the results of using the present 
invention to model a multiple input multiple output dynamic 
nonlinear system; 

FIG. 16 illustrates using the present invention to deter- 
mine hidden or unknown states and parameters of a system; 

FIG. 17 illustrates using the present invention to model 
time series system; and 

FIG. 18 illustrates a control system using the present 
invention. 

DESCRIPnON OF THE PREFERRED 
EMBODIMENTS 

The present invention, as illustrated in FIG. 1, comprises 
a feed forward multilayer perception network 10 augmented 
with (locally) recurrent (node output is fed back to itself 
after a unit delay) and crosstalk (node output is fed to other 
nodes of the same layer after a unit delay) feedback links. 
The delay links allow the nodes to maintain past information 
or memoiy and process the memory along with current 
inputs, thereby allowing the network to model temporal 
nonlinearities. 

This network 10 includes an input layer 12 with input 
nodes 14 which act as buffers or storage elements. The 
network 10 also preferably includes two hidden layers 16 
and 18, although a single hidden layer or more than two 
hidden layers could be provided. 

Each hidden layer includes recurrent crosstalk nodes 20. 
These nodes 20 each have a local feedback loop 22 with a 
unit delay 24 in the feedback path. Each node 20 also 
includes crosstalk links 26, 28 and 30 from the nodes within 
that layer. Each of the crosstalk links also has a unit delay 
32, 34 and 36. This results in connection of each node within 
each layer to every other node within that layer. Additional 
details with respect to the node will be described with 
respect to FIG. 2. 

The first hidden layer 16 is fully connected to the input 
layer 12, the second hidden layer 18 is fully connected to the 
first hidden layer 16 and an output layer 38 is fully con- 
nected to the last hidden layer 18. The hidden layers 16 an 
18 can include more or less nodes than each of the input 12 
and output 38 layers, and it is preferred that the hidden layers 
not have the same number of nodes as the input or output 
layers and the number of nodes for the subsequent hidden 
layers being less than for preceding layers. In a typical 
implementation the number of nodes 14 in the input layer 12 
equals the number of inputs to a system being modeled and 
the number of output nodes 40 in the output layer 38 equals 
the number of outputs in the system being modeled. 

Output nodes 40 of the output layer 38 can each be a 
simple summing buffer, or perform the transfer function of 
the hidden layer nodes 20 without the local feedback and 
crosstalk, or perform a transfer function different from the 
hidden layer also without the feedback and crosstalk. 

The basic elements of a node 20 (or 40) in accordance 
with the present invention, as illustrated in FIG. 2, are a 
summing unit 50 and a transfer function unit 52. The 
summing unit 50 performs a conventional sum of all its 
inputs while the transfer function unit 52 performs a dis- 
criminatory function and can be any of the transfer functions 


4 

used in the neural networks such as linear, saturation, 
sigmoid, hyperbolic tangent, or any other function. The 
sigmoid and hyperbolic tangent transfer functions are pre- 
ferred for the hidden layers and the linear and saturation 
5 functions are preferred for the output layer 38, however, 
each hidden layer can perform a different transfer function. 
Each node also includes weight multiplication units 54-64 
which multiply the incoming signals by a weight which is 
adjusted during the learning process. The output of the 
Q transfer function unit 52 is fed back through a unit delay 
device 66 before being multiplied by the weight multipli- 
cation unit 54. The nodes 56-60 receive inputs from the 
previous layer in the network while the weight units 62-64 
receive inputs from the other nodes in the layer (the 
crosstalk) through unit delay devices 68 and 70. Although a 
15 single time unit delay is preferred in both the feedback and 
crosstalk paths, the time delay could be more than one time 
unit and the delays in the feedback and crosstalk paths need 
not be equal. 

FIG. 3 illustrates an analog embodiment of the present 
20 invention in which a single hidden layer 80 is used. The 
inputs to the network 78 are weighted by weighting resistors 
82, while the output layer 84 includes transfer function 
amplifier units 86 and weight resistors 88. The hidden layer 
80 includes transfer function amplifier units 90, delay units 
25 92, feedback weight resistors 94 and crosstalk weight resis- 
tors 96. A typical delay in an analog device would be 
approximately 10 psecs. The details of the delay unit 92 for 
a continuous time analog network are illustrated in FIG. 4, 
and the details of the transfer function units 86 and 90 are 
30 illustrated in FIG. 5. 

The delay unit 92 (FIG. 4) for the continuous time analog 
implementation includes conventional amplifiers 110 and 
112 as well as a resistor 114 and a capacitor 116. The transfer 
35 function amplifier units 86 and 90 (FIG. 5) include an 
operational amplifier 120, resistors 122-126 and a capacitor 
128. The relationship between the resistors and capacitors in 
the unit delay 90 should result in R 2 C 2 being much, much 
less than RjCj. 

40 FIG. 6 illustrates a discrete time analog implementation of 
the nodes of the network 10. The device illustrated in FIG. 
6 implements not only the transfer function of the network 
of FIG. 3, as well as the local and crosstalk delay, but also 
the input weights provided by resistors 82 in FIG. 3 and the 
45 crosstalk and feedback weights provided by resistors 94 and 
96. This device includes switching transistors 140-148 
controlled by a first phase signal ^ which, as shown, is 180° 
out of phase with a second phase signal <J) 2 and which 
controls, switching transistors 150-156. The output of an 
50 operational amplifier 158 is fed back through capacitors 160 
and 162, where one of the inputs from the prior layers, the 
recurrent input or one of the cross talk inputs, is weight 
adjusted by capacitor 164 when the weight is negative and 
is adjusted by the capacitor 166 when the weight is positive. 
55 The weight value is -a 1 in the negative weight situation and 
is Oj in the positive weight situation. 

A discrete component digital implementation of each 
node is illustrated in FIG. 7. This node can be implemented 
with plural nodes on a single chip using conventional very 
60 large scale integrated circuit technology. In this embodi- 
ment, the transfer function operation is performed by a 
threshold lookup table 182 which would be implemented by 
a read only memory (ROM) that performs a conventional 
transformation of its input to an output using a lookup 
65 operation. The sum function of the node is performed by a 
conventional accumulator 184, while the weight multiplica- 
tion of the inputs is performed by a conventional multiplier 



5,479,571 


6 


5 

186 receiving the inputs to the node and multiplying the 
inputs times a weight output by a weight ROM 188. The 
selection of weights as well as the selection of inputs to be 
multiplied is performed by a cyclic counter 190 and a 
multiplexer 192. When the end of the count has been reached 5 
and the count is reset, the cyclic counter 190 outputs a clear 
signal that clears the accumulator 184. The output of the 
node is stored in an output register 194 and fed back through 
a feedback delay register 196 connected to the multiplexer 
192. The delay in the crosstalk connections is provided by 10 
crosstalk delay registers 198-200. Because the delay is 
provided by registers, the delay is equal to the time between 
samples of the input signals to the network. These input and 
output signals to the node are preferably eight bits. The 15 
embodiment illustrated in FIG. 7 requires that each node be 
connected to also the other nodes by individual busses, 
which can provide significant interconnection problems on 
integrated circuit chips and also limits the number of nodes 
that can be formed on a single chip. This problem can be 20 
overcome by connecting the nodes to a bus that connects 
between two adjacent layers. The input to each node would 
include a demultiplexer connected to the bus. The demulti- 
plexer would be connected to input registers which would 
store the inputs from the previous layer as well as the 25 
crosstalk links. Such an embodiment would also require an 
input multiplexer control counter for controlling the selec- 
tion of the appropriate input storage register in accordance 
with the node output on the bus. Such an embodiment would 
also be provided with an output gate for output register 194 30 
which would output contents of the output register 194 onto 
the bus for the subsequent layer, when the count in the input 
control counter equals a certain value, thereby placing the 
output onto the bus for the next layer at the appropriate 35 
timing. 

FIGS. 8 A and 8B comprise a flowchart of the network of 
FIG. 1, as well as the learning method, being implemented 
on a single general purpose processor. It is preferred that a 
computer, such as the VAX 9000 available from Digital 40 
Equipment Corporation which includes a vector processor, 
be used and that the FORTRAN or “C” languages be used 
to implement this flowchart. The process essentially per- 
forms a multilayer feed forward recurrent transfer operation 
with static back propagation recurrent gradient descent 45 
learning. 

The system starts by randomly generating 222 the weights 
and the bias for each of the nodes followed by initialization 
224 of an iteration counter as well as initialization 226 of the 
outputs to zero. Next, the system reads 228 the entire 50 
training input sample set. That is, if the network includes 10 
input nodes and-there are 100 input samples or steps where 
each set includes 10 actual inputs, then 100 time sample sets 
of inputs would be read for a total of 1000 inputs. The 
system then initializes 230 a time or set counter to one, 55 
initializes an error to zero 232 and then initializes 234 a layer 
counter. The system then enters a loop which performs the 
transfer function for all of the nodes in all of the layers. The 
first step in this loop is to increment the layer counter. The 
system then performs 238 the transfer function for all of the 60 
nodes in the layer in accordance with the equations (l)-(3): 

(1) 


-continued 

MO (2) 

Z[lj](k)=± W [ i jm} X [l j ] (k-l) + 

J = 1 

\ 2 * + b[ij], for £ = 1, . . . M0 

7=1 

X [hJ] {k) = F m (Z u>i] (k)\ for i = M0 (3) 

where k is the time step, L is the total number of layers, 1 is 
the layer number, N(l) is the number of nodes in layer 1, i is 
a pointer variable, j is a pointer variable, U;(k) is the ith input 
at time step k, X [U] (k) is the output of node i of layer 1 at time 
step k, W [ZJ][r>i] is the weight applied to the signal transferred 
from the node j of layer 1 to node i of layer 1', where 1’ can 
equal 1 when local feedback or crosstalk occurs, b [U] is the 
bias for node i of layer 1 and defines the region where each 
node is active, Z [W (k) is a temporary variable storing the 
output of the summation of the weight times the inputs, and 
F [n is the discriminatory transfer function for the nodes of 
layer 1. 

Once the transfer function outputs for all the nodes in the 
network have been obtained using one of the training sets, 
the system enters a learning phase. The learning phase 
involves a backward pass through the network during which 
the output layer error gradient signals are used to back 
propagate the error gradients for each node of the hidden 
layers. Then the network node weights and biases are 
updated. The transfer function output calculation, backward 
pass and update is then performed for the remaining training 
sets. The learning phase steps are discussed in detail below. 

The system first determines 240 whether the last layer has 
been performed and then updates an error in accordance 
with equation (4). 

ML) ( 4 ) 

£(new value) = £(old value) + (1/2) l 

M 

(X [L jflO - Yj(k)) 2 

where E is the error and Y^Qc) is the target output at the step 
k. The system then calculates 244 the gradients for the 
output layer and then calculates 246 the gradients for the 
hidden layers in accordance with equations (5) and (6). 


dE(k)!dX [Lti] =X [Lji] {k) - Yi{k\ for i = 1, . . . , ML) w 

M/+1) (6) 

dE(k)/dX[ij] = Z W[^y + i,i]F[/ + i](Z[i+i,,](fe)) 

dE(k)/BX v+Jti]t f Sj = 1, .... M0 

where 3E(k)/0X [U] is the gradient and F' tZ] is the derivative 
of F g a- 

Next, the system updates 248 weights for the feed forward 
connections in accordance with equation (7) as well as the 
local feedback and crosstalk connections in accordance with 
equation (8) and also updates the bias in accordance with 
equation (9). 

n -r $L' {k)tdX[i ^ £’[?](Z I / t t](it))X [? _ 1 jj(fc), for all /, 
i and j (7) 

(Z M (kyyXtf ^ (k—\ ) for all l, i 

and j (8) 

*\dE(k)fdXn ^ (k)) for all l and i (9) 


X [U1 (k) = Ui(k), for 1 - 1, ... , MO and 
k-1, . . . t T 


65 


where T| is the learning rate which is preferably set at 0.005 
to 0.1. 
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This having been done the system then increments 250 the 
time step counter and determines 252 whether the end of the 
time period has been reached. If the end of the time period 
has not been reached, the system cycles back to continue 
processing. If the end of the time period has been reached the 
iteration counter is incremented 254 and then the error is 
compared to 256 to an error tolerance where it is preferred 
that error tolerance be approximately from 0.1% to 1%. If 
the error has been reduced below the tolerance the system 
stops. Otherwise a determination 260 is made as to whether 
the iteration count is greater than the maximum iteration 
count, where the maximum number of iterations is prefer- 
ably one-hundred thousand. If the maximum number of 
iterations have been performed the system stops otherwise 
the system returns to perform another iteration. 

Once the system has completed the learning phase using 
the training sets, the processor can be connected up to 
receive actual real time inputs and will predict the outputs of 
the actual system based on those inputs. When the system is 
executed in real time after the off-line learning has occurred, 
if real time learning is desired, only the steps in blocks 228, 
236, 238, 240, 244, 246 and 248 are performed. If the system 
is not intended to continue learning only the functions in 
steps 228, 236, 238 and 240 are performed. 

The embodiment illustrated in FIGS. 8A and 8B is what 
is called a static embodiment in which all of the outputs of 
the nodes are calculated before the gradients are calculated 
and before the new weights for the nodes are also calculated. 
In an alternate and more preferred embodiment the error 
gradients are held over to the next step, so that past values 
of the error gradients are accounted for to allow dynamic 
learning. In this alternate embodiment the invention forward 
propagates the gradients and as a result can perform the 
gradient calculation for a layer of nodes immediately after 
performing the output calculations for the nodes of that 
layer. This embodiment thus essentially performs a multi- 
layer recurrent transfer operation with dynamic recurrent 
gradient descent learning. This process is illustrated in the 
embodiment of FIGS. 9A-9C. 

As in the previous embodiment, the present invention 
randomly generates 282 the weights and the biases for each 
node in each layer and initializes 284 an iteration counter. 
Instead of just initializing the outputs to zero the present 
invention also initializes 286 the gradients to zero. When in 
the learning mode, the system, as in the previous embodi- 
ment, also reads 288 the input set followed by initializing 
290 the time counter and initializing 300 the error. Next the 
system initializes 302 the layer counter, increments 303 the 
counter and then executes 304 the transfer function for all 
the nodes in the layer in accordance with equations (1)— (3) 
previously discussed. 

Once the transfer function outputs are produced for a 
layer the system enters the learning phase in which the 
output gradients for the layer are determined in terms of the 
output gradients of the previous layer and the output gradi- 
ents of the previous time step until the current layer gradi- 
ents are obtained. That is in the first pass through a layer the 
outputs are obtained and in the second pass the output, 
gradients are propagated forward. As a result the present 
method eliminates the need to wait until the outputs at the 
output layer are produced before weight change calculations 
can occur, thereby speeding up the learning process. The 
details of the learning phase will be discussed in more detail 
below. 
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The system determines the gradients for the current layer 
by first propagating 306 the gradients through the local 
feedback and crosstalk connections in accordance with 
equations (10) and (11): 

5 

MO (10) 

Si(Ui,f) = t WvriVMidXvriik - Ww [lM ii ) 
m—1 

for l = F, all n, i and j 

10 jvrn (li) 

m= 1 

for / % r, all n, i and j 

where Sj and S 2 are temporary variables for storing the 
15 effect on the weights on the gradients, 1 is the layer associ- 
ated with the currently considered weight, and n, i and j are 
network variables. Then during the gradient calculations for 
the layer the system propagates 308 the gradients forward 
from the previous layer through the feed forward connec- 
20 tions in accordance with equations (12) and (13): 

( 12 ) 

jvcr-i) 

25 S\(l,n,i,j) + Z W[r-i,m][r,n]0%-i^i](0/3^[?ji[i,o) 

m=l 

for l < T, all n, i and j 

S 2 {l,n,i,j ) = (13) 

30 W-l) 

Si{l,n,i,j) + X %- i [/,!]) 

771=1 

for l < T, all n, i and j 

followed by propagating 310 the effect of the current and 
prior outputs of nodes in accordance with equations (14) and 
35 ( 15 ): 

i,J>S ! (J, 7i, i, jh-X [rji (£-1 ) for all n, j, l~V and v=n (14) 

S 2 (l, n, i,jy=S 2 (J>n, i , ^ (&) for all n, j, l=V and i=n (15) 

40 and then propagating 312 the effect of the transfer function 
on the gradients in accordance with equations (16) and (17): 

dX n e»[i :n] (k)/dW [(J] l i'r£ 1 (l,n,ij)*F\ n (Z r n] (k)) for all l, n, i and j 
( 16 ) 

45 

3X newU>] (fc)/d [U] =S 2 ( l < n > ij)* F \n^ z ir,K]( k )) for all l, n, i and 
j (17) 

where dX ne Jd W is the new value for the output gradient. 
Next for this layer the biases are propagated within the 
50 layer by first propagating 314 the bias through the crosstalk 
and local feedback connections in accordance with equation 
( 18 ): 

55 N(F) (18) 

771=1 

for l S l and all n and i 

propagating 316 the biases forward in accordance with 
equation 19, propagating 318 the effect of the outputs in 
60 accordance with equation (20) and propagating 320 the 
effect of the transfer function for the bias in accordance with 
equation (21). 

65 N(l- 1) < 19 ) 

5 3 (4 n,i) = S 3 (U0 = Z %- i,m][r, n iOX [r _i, ml (Jfc)/8Z7 [i ,r|) 
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-continued 


-continued 


for / < /' and all n and i 


= held) - tl(£,3B0r,a 2 B0r 2 tfE/d/ydE/dr 1 


(30) 


S 3 {l,n,i ) = S 3 (l,n,i ) = 4- 1 for all n, l = l' and i = n 5 

= S 3 {l,n,i) * F p-jCAp^fc)) (20) 

for all l n and i (21) 

The system then determines 322 whether all the layers have 1C 
been processed and if not returns to perform further pro- 
cessing. If all of the layers have been processed, the system 
stores 324 the calculated gradients and biases for use in 
future weight calculations where equations (10), (11) and 


(18) have taken prior gradients into account. 15 

dX v , n] (k)m vj]U ^ (22) 

aX [rjt ,(J:)/aV(i=SX « w[rjt] (i k)ldb M for aiu V, n, i and j. (24) 2 ° 

This step results in dynamic learning. 


Next the system calculates the weights for the nodes in the 
layer for the feed forward (equation 25), local feedback and 
crosstalk connections (equation 26) and updates the bias in 25 
accordance with equation (27). 

W[0][/,i] = %j][U3 “ n V (X\Lsi](k) - Y n (k)) 
n=l 

(dX M (k)/dW [l>mi] ) for all l, i and j 

w v - imn = w a - lmn - t\ N< i P - Y n (k )) 

n= 1 

for all /, i and j 

hm = b m - n V (*&■](*) _ *«(*)) 

n=l 

(dX[^(k)/3b U {l ) for all l and i 

The system then updates 332 the error in accordance with 
equation (4). The system then performs the same loop 
control functions 336-346 as performed in the embodiment 
described with respect to FIG. 8. 45 

The present invention, as illustrated in FIG. 9, is capable 
of on-line or off-line supervised learning in which test sets 
are presented to the network. 

Occasionally, a neural network converging toward a glo- 
bal optimal solution will get trapped in a local optimum and 50 
in such a situation the present invention can be augmented 
by a step which forces the network out of a local minimum. 

In addition, this step results in significant learning accelera- 
tion. This is attained by normalizing 360 the weights and 
biases as illustrated in FIG. 10 in accordance with equations 55 
(28)— (30). 


(25) 

30 

(26) 

35 

(27) 

40 


where W is any weight in the network, b is any bias in the 
network, r is any weight or bias in the network, p is the order 
of the highest moment of the error gradient used in the 
learning algorithm defined by the learning algorithm chosen, 
where in the current situation p=I, q is the power to which 
the norm of the error gradient is raised, where in the current 
situation q~2, rj(E,3E/3r, 3 2 E/3r 2 , . . . ,3 p E/3r°) is a positive 
function of E, such as t|(E)= pE3E/3r where p is a constant 
such as 0.01. Other functions can be used such as an 
exponential or hyperbolic tangent. This functional depen- 
dence of T| allows the learning rate to change adaptively with 
the error E. This method of updating adjusts the learning rate 
allowing faster convergence to the global optimum with a 
reasonable number of iterations without adding unnecessary 
tuning parameters. During operation as the solution 
approaches a local minimum, because the error does not 
approach zero, the ratio of the error gradient to the norm 
squared of the error gradient will approach infinity causing 
the weights to jump in magnitude thus moving away from 
the local minimum. However, as the global minimum is 
approached the error function will approach zero resulting in 
a smooth convergence of the weights to their optimum 
values. 

The above-described acceleration technique is not limited 
to the architecture or the learning methods disclosed herein 
and is applicable to other architectures, such as a feed 
forward architecture, and to any learning method, such as a 
back propagation method. In the general case, the weights 
and the biases are updated according to equations (29) and 

(30) , where ||3E/3r|| 2 means the sum of the squares of 3E/3w 
for all weights w and the squares of 3E/3b for all biases b in 
the network. 

The present invention can be used to identity or model the 
behavior of dynamic non-linear system, thereby allowing 
real time diagnostic and predictive control of large scale 
complex non-linear dynamic systems, such as power plants. 
Multiple input multiple output (MEMO) dynamic non-linear 
systems can be used to model real world MEMO systems 
such as a process control plant as illustrated in FIG. 11. In 
such a system, the system inputs to the actual plant 400 are 
also provided to the present invention 402 and applied to the 
neural network model 404 as represented by equations 
(l)-(3). The system determines the 406 error difference 
between the predicted outputs (equation (4)) and the actual 
outputs and this is used to change the model parameters 
using equations (22)-(24) or (7)-(9). 

A representative MIMO nonlinear system can be 
expressed by the following state and output equations 

(31) — (35): 

x 1 (*>=0.5[jc 1 (jk-l)] 1/2 +0.3x 2 (k-D^ik-^iy+Q.Su.ik) (31) 

x 2 {k)=Q.5[x 2 (,k-l )] 1/2 +0.3jc 3 (&-1) jq (£-1)+ 0.5U#) (32) 


P#9rf= (28) 

1=1 F =1 7=1 

xTT (dww m ,j ] ) 2 + I. T @Eab mm ) 2 

1=1 r=l t=1 1=1 i=l 

«W) = W(old) - T\(E,dE/dr,d 2 E/dr 2 , . . . , E/d/ydE/dJ (29) 



i x 3 (jt)=0.5[x 3 (Jt-l )] y2 + 0.3x l (k-l)x 2 (k-l)+0.5u 2 (k) 

(33) 

60 

y^k^SMkh-^ky+x^k) 

(34) 


y 2 (k)=2x 1 2 (k) 

(35) 


The network 404 used to model this system included an 
input layer with 2 nodes, 2 hidden layers with 12 and 10 
65 nodes, respectively, and an output layer with 2 nodes (2-12- 
10-2). The initial training set included of all 25 possible 
combinations of steps with magnitudes 0.125,0.25, 0.375, 
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0.5, as well as the zero input for both input channels. Each 
signal contained 15 data points. The network was trained for 
800 cycles, where one cycle (iteration) consisted of one 
presentation of the above training set, using a 0.01 learning 
rate for both the weights and the biases. 5 

The pulse inputs used to test the identified network are 
shown in FIG. 12. However, the network response to these 
impulses was not satisfactory. Initially, the network followed 
the model closely, however, once the pulses ceased to exist 
the network and the analytic model reach different equilib- 10 
rium values. The reason for such behavior turned out to be 
the two equilibria of the unforced system under consider- 
ation. Specifically, (yi=0i, y 2 =0) and ^=0.18, y 2 =0.45) are 
the equilibria of the unforced system. The initial training set, 
however, did not contain any information about the second 15 
equilibrium point. Therefore, 5 more pulses of suitable 
magnitude each containing 40 data points, were included in 
the initial training set, to account for the observed second 
equilibrium point. Training was continued with the aug- 
mented set for 100 more iterations using the same learning 20 
rate as before. This approach indicates that the network is 
capable of incremental learning for complex systems. That 
is, initially, a small but representative set of input signals can 
be considered in the training set. If the network is not 
capable of generalizing from this limited set of signals, 25 
additional training is performed using the network obtained 
at the end of the previous learning session, with a training set 
enhanced with more representative signals. 

Following training with the augmented input set, three 
tests were performed with signals unknown to the network 30 
for investigating its performance. The network and analytic 
model responses to the inputs depicted in FIG. 12, are shown 
in FIG. 13. The calculated RMS prediction error for this 
signal was 4.8%. The second test: 


signal was u\(k) - 0.3 + 0.2 sin 

« 2 <*) = 0.2, (37) 4Q 

where the step input in the second channel was delayed by 
5 time steps. The network and analytic model responses to 
this input set are depicted in FIG. 14, with the RMS 
prediction error evaluated at 4.8%. The final test set con- 
sisted of a step augmented with zero mean white Gaussian 45 
noise of 0.1 standard deviation. The magnitudes for the steps 
were 0.3 and 0.2 for the two input channels, respectively. 
FIG. 15 presents the network and analytic model responses 
to this test signal, with a calculated RMS prediction error of 
2.1%. 50 
The present invention, as illustrated in FIG. 16, can also 
be used to identify hidden parameters or states in non-linear 
systems. The estimation of hidden parameters and states 
involves inferring unobservable variables or non-measur- 
able parameters, for example, system states which cannot be 55 
measured or system parameters which are not known or are 
varying, based on measured variables. In such a situation the 
training set applied to the present invention includes the 
actual state of the unknown parameter at the time the data 
was taken. For example, when estimating the inertial char- 60 
acteristics of a spacecraft, an input to the invention would be 
the inertia matrix of the spacecraft for a geometrically 
known configuration. This inertial matrix would have been 
calculated apriori using the known spacecraft configuration. 
The actual inputs to the plant are also provided to the model 65 
404 and the model 404 produces not only output estimates 
equivalent to the actual outputs for error estimation purposes 
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but also an output which corresponds to the unobserved state 
of the parameter. This extra parameter in comparison with 
the actual state of the parameter is included in the error 
determination operation during training but not during actual 
use. 

As illustrated in FIG. 17 the present invention can also be 
used to predict time series systems, such as the stock market, 
by applying the inputs received by the actual time series 
system to the present invention and developing a time series 
model in a manner similar to that discussed with respect to 
FIG. 11. This system is taught and used in a similar manner 
as discussed with respect to FIG. 11. 

Furthermore, as illustrated in FIG. 18 the present inven- 
tion can be used to control non-linear dynamic systems, such 
as spacecraft systems or process systems by developing a 
model 402 of the plant as discussed with respect to FIG. 11 
and then using conventional learning rules system 408 to 
adapt the adjustable parameters of a conventional controller 
414 (such as a proportional-integral or state feedback con- 
troller) or a neurocontroller. In this embodiment a switch 
416 is used to initiate or terminate learning. The switching 
is controlled by the system operator or by control software 
and the reference in FIG. 18 represents the desired plant 
operating set-point or trajectory. 

The many features and advantages of the invention are 
apparent from the detailed specification and thus it is 
intended by the appended claims to cover all such features 
and advantages of the invention which fall within the true 
spirit and scope of the invention. Further, since numerous 
modifications and changes will readily occur to those skilled 
in the art, it is not desired to limit the invention to the exact 
construction and operation illustrated and described, and 
accordingly all suitable modifications and equivalents may 
be resorted to, falling within the scope of the invention. The 
node and network of present invention have been described 
using a gradient descent learning method, however, other 
methods can be used. 

What is claimed is: 

1. A neural network system, comprising: 

a hidden layer with nodes each transforming input signals 
into an output signal using weights for weighting and a 
memory for storing signals; and 

teaching means for forward propagating error gradients in 
the nodes and normalizing weights responsive to an 
error. 

2. A method of teaching a neural network having inter- 
connected layers, comprising the steps of: 

(a) applying test input signals to the network; 

(b) producing, by a first of the layers, first output signals 
by transforming the input signals into the first output 
signals using a memory for storing intermediate sig- 
nals; 

(c) calculating weights for the first of the layers by 
performing dynamic recurrent gradient descent learn- 
ing; and 

(d) producing, by a second of the layers after step (c), 
second output signals form the first output signals. 

3. A method of teaching a neural network having inter- 
connected layers, comprising the steps of: 

(a) applying test input signals to the network; 

(b) producing, by a first of the layers, first output signals 
by transforming the input signals into the first output 
signals using a memory for storing intermediate sig- 
nals; 

(c) calculating weights for the first of the layers by 
calculating the weights from prior gradients; and 
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(d) producing, by a second of the layers after step (c), 
second output signals form the first output signals. 

4. A method of teaching a neural network having inter- 
connected layers, comprising the steps of: 

(a) applying test input signals to the network; 

(b) producing, by a first of the layers, first output signals 
by transforming the input signals into the first output 
signals using a memory for storing intermediate sig- 
nals; 

(c) calculating weights for the first of the layers by 
normalizing the weights responsive to an error; and 

(d) producing, by a second of the layers after step (c), 
second output signals from the first output signals. 

5. A method of teaching a neural network having inter- 
connected layers, comprising the steps of: 

(a) applying the test input signals to the network; 

(b) producing, by a first of the layers, first output signals 
by transforming the input signals into the first output 
signals using a memory for storing intermediate sig- 
nals; 

(c) calculating weights for the first of the layers; and 
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(d) producing, by a second of the layers after step (c), 
second output signals from the first output signals 
wherein the network includes local feedback and 
crosstalk and step (b) includes producing the first 
5 output signals using the local feedback and crosstalk 

and step (c) includes calculating the weights using the 
local feedback and crosstalk. 

6. A method of performing neural network processing for 
interconnected layers, comprising the steps of: 

to (a) feeding output signals forward from a first hidden 
layer to a second hidden layer; and 

(b) calculating output signals for the second hidden layer 
by transforming the output signals from the first hidden 
layer, delayed crosstalk output signals from the second 

15 hidden layer and delayed local feedback output signals 
from the second hidden layer stored in a local feedback 
memory into the output signals. 

7. A method as recited in claim 6, further comprising: 

(c) updating gradients for the second hidden layer imme- 
20 diately after step (b). 
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